Eeva Vakkari
https://github.com/EevaVakkari/IODS-project
date()
## [1] "Mon Nov 27 15:55:03 2023"
It has been quite a heavy crash course into the world of R. The book is good, especially with the exercises and notes in the markdown file, but it’s massive for a beginner.
With the R syntax, I still feel almost as if I was illiterate, but I trust I’ll get used to R jargon with time and more exercises.
After this course, I wish I could work somewhat independently, at least with very basic things, in R. I aim to familiarize myself with it at least to the level where I’ll be able to understand the error messages I get and the solutions I google.
Eeva Vakkari
https://github.com/EevaVakkari/IODS-project
This week, I studied linear regression using a survey data set combined with exam results from a university statistics course. I made a simple linear model and validated it graphically. First two code blocks contain the data wrangling part and, after them, the analysis part starts.
date()
## [1] "Mon Nov 27 15:55:04 2023"
I performed the data wrangling as instructed:
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
lrn14 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header=TRUE)
lrn14$attitude <- lrn14$Attitude / 10
deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30","D06", "D15", "D23", "D31")
lrn14$deep <- rowMeans(lrn14[, deep_questions])
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21","SU29","SU08","SU16","SU24","SU32")
lrn14$surf <- rowMeans(lrn14[, surface_questions])
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20","ST28")
lrn14$stra <- rowMeans(lrn14[, strategic_questions])
str(lrn14)
## 'data.frame': 183 obs. of 64 variables:
## $ Aa : int 3 2 4 4 3 4 4 3 2 3 ...
## $ Ab : int 1 2 1 2 2 2 1 1 1 2 ...
## $ Ac : int 2 2 1 3 2 1 2 2 2 1 ...
## $ Ad : int 1 2 1 2 1 1 2 1 1 1 ...
## $ Ae : int 1 1 1 1 2 1 1 1 1 1 ...
## $ Af : int 1 1 1 1 1 1 1 1 1 2 ...
## $ ST01 : int 4 4 3 3 4 4 5 4 4 4 ...
## $ SU02 : int 2 2 1 3 2 3 2 2 1 2 ...
## $ D03 : int 4 4 4 4 5 5 4 4 5 4 ...
## $ ST04 : int 4 4 4 4 3 4 2 5 5 4 ...
## $ SU05 : int 2 4 2 3 4 3 2 4 2 4 ...
## $ D06 : int 4 2 3 4 4 5 3 3 4 4 ...
## $ D07 : int 4 3 4 4 4 5 4 4 5 4 ...
## $ SU08 : int 3 4 1 2 3 4 4 2 4 2 ...
## $ ST09 : int 3 4 3 3 4 4 2 4 4 4 ...
## $ SU10 : int 2 1 1 1 2 1 1 2 1 2 ...
## $ D11 : int 3 4 4 3 4 5 5 3 4 4 ...
## $ ST12 : int 3 1 4 3 2 3 2 4 4 4 ...
## $ SU13 : int 3 3 2 2 3 1 1 2 1 2 ...
## $ D14 : int 4 2 4 4 4 5 5 4 4 4 ...
## $ D15 : int 3 3 2 3 3 4 2 2 3 4 ...
## $ SU16 : int 2 4 3 2 3 2 3 3 4 4 ...
## $ ST17 : int 3 4 3 3 4 3 4 3 4 4 ...
## $ SU18 : int 2 2 1 1 1 2 1 2 1 2 ...
## $ D19 : int 4 3 4 3 4 4 4 4 5 4 ...
## $ ST20 : int 2 1 3 3 3 3 1 4 4 2 ...
## $ SU21 : int 3 2 2 3 2 4 1 3 2 4 ...
## $ D22 : int 3 2 4 3 3 5 4 2 4 4 ...
## $ D23 : int 2 3 3 3 3 4 3 2 4 4 ...
## $ SU24 : int 2 4 3 2 4 2 2 4 2 4 ...
## $ ST25 : int 4 2 4 3 4 4 1 4 4 4 ...
## $ SU26 : int 4 4 4 2 3 2 1 4 4 4 ...
## $ D27 : int 4 2 3 3 3 5 4 4 5 4 ...
## $ ST28 : int 4 2 5 3 5 4 1 4 5 2 ...
## $ SU29 : int 3 3 2 3 3 2 1 2 1 2 ...
## $ D30 : int 4 3 4 4 3 5 4 3 4 4 ...
## $ D31 : int 4 4 3 4 4 5 4 4 5 4 ...
## $ SU32 : int 3 5 5 3 4 3 4 4 3 4 ...
## $ Ca : int 2 4 3 3 2 3 4 2 3 2 ...
## $ Cb : int 4 4 5 4 4 5 5 4 5 4 ...
## $ Cc : int 3 4 4 4 4 4 4 4 4 4 ...
## $ Cd : int 4 5 4 4 3 4 4 5 5 5 ...
## $ Ce : int 3 5 3 3 3 3 4 3 3 4 ...
## $ Cf : int 2 3 4 4 3 4 5 3 3 4 ...
## $ Cg : int 3 2 4 4 4 5 5 3 5 4 ...
## $ Ch : int 4 4 2 3 4 4 3 3 5 4 ...
## $ Da : int 3 4 1 2 3 3 2 2 4 1 ...
## $ Db : int 4 3 4 4 4 5 4 4 2 4 ...
## $ Dc : int 4 3 4 5 4 4 4 4 4 4 ...
## $ Dd : int 5 4 1 2 4 4 5 3 5 2 ...
## $ De : int 4 3 4 5 4 4 5 4 4 2 ...
## $ Df : int 2 2 1 1 2 3 1 1 4 1 ...
## $ Dg : int 4 3 3 5 5 4 4 4 5 1 ...
## $ Dh : int 3 3 1 4 5 3 4 1 4 1 ...
## $ Di : int 4 2 1 2 3 3 2 1 4 1 ...
## $ Dj : int 4 4 5 5 3 5 4 5 2 4 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
dim(lrn14)
## [1] 183 64
At this point, the data set consists of 63 numerical and 1 character variables. All together 183 observations. Now, let’s focus on a subset of the data:
library(dplyr)
keep_columns <- c("gender","Age","Attitude", "deep", "stra", "surf", "Points")
analysis <- select(lrn14, one_of(keep_columns))
analysis <- filter_if(analysis, is.numeric, all_vars((.) != 0)) #removing all zeros
setwd("~/Polyembryony_R/IODS/IODS-project/data")
write.csv(analysis, "learning2014.csv", row.names = F)
a <- read.csv("learning2014.csv") #Checking that csv is OK.
str(a)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
head(a)
## gender Age Attitude deep stra surf Points
## 1 F 53 37 3.583333 3.375 2.583333 25
## 2 M 55 31 2.916667 2.750 3.166667 12
## 3 F 49 25 3.500000 3.625 2.250000 24
## 4 M 53 35 3.500000 3.125 2.250000 10
## 5 M 49 37 3.666667 3.625 2.833333 22
## 6 F 38 38 4.750000 3.625 2.416667 21
Looking good. Continuing into analysis part of the assignment.
Setting the working directory and reading the csv into a data frame:
setwd("~/Polyembryony_R/IODS/IODS-project/data")
learning2014 <- read.csv("learning2014.csv")
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
## gender Age Attitude deep stra surf Points
## 1 F 53 37 3.583333 3.375 2.583333 25
## 2 M 55 31 2.916667 2.750 3.166667 12
## 3 F 49 25 3.500000 3.625 2.250000 24
## 4 M 53 35 3.500000 3.125 2.250000 10
## 5 M 49 37 3.666667 3.625 2.833333 22
## 6 F 38 38 4.750000 3.625 2.416667 21
The data consists of learning approach survey results combined with student success in the exam. There are 7 variables and 166 observations.
pairs(learning2014[-1])
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
p <- ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
Exam success, as represented by “Points”, seems to correlate with attitude. Other plots do not suggest strong correlations between the variables. Let’s study attitude, age and gender as explanatory variables for exam points.
my_model1 <- lm(Points ~ Attitude + Age + gender, data = learning2014)
summary(my_model1)
##
## Call:
## lm(formula = Points ~ Attitude + Age + gender, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## Attitude 0.36066 0.05932 6.080 8.34e-09 ***
## Age -0.07586 0.05367 -1.414 0.159
## genderM -0.33054 0.91934 -0.360 0.720
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
When studied with two-sided Student’s t-test, it seems that only attitude has statistically significant effect on exam points whereas age and gender do not explain the exam success. Attitude correlates positively with exam points: with better attitude, the exam points tend to be higher. Making a new model with attitude as the only explanatory variable.
my_model2 <- lm(Points ~ Attitude, data = learning2014)
summary(my_model2)
##
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Also, the second model shows that attitude clearly has a statistically significant, positive correlation with exam points. Removing age and gender from the model drops adjusted R-squared value by 0.0014, indicating a very minor (i.e. insignificant) drop in the power of the model. Making diagnostic plots with the second model.
plot(my_model2, which = c(1, 2, 5))
Linear regression model is based on four assumptions: 1. There is a linear relationship between predictor(s) and outcome 2. The residuals are independent 3. The residuals are normally distributed 4. The residuals have equal variance As demonstrated before, the attitude and exam points have linear relationship. We can confidently assume that residuals are independent since the original survey is represent individual students at a single time point, i.e. autocorrelation is not a concern. This is seen as a flat line in the residuals vs. fitted values plot. The Q-Q plot shows that the residuals are not completely normally distributed but quite close to it. The residuals vs leverage plot shows that all data points are nowhere close to Cook’s distance, i.e. don’t have high leverage. So, there are no too influential single data points.
Eeva Vakkari
https://github.com/EevaVakkari/IODS-project
This week we practice logistic regression with a survey study combining alcohol consumption and student’s grades, as well as demographic, social and school related features. The data was uploaded from from UCI machine learning repository http://www.archive.ics.uci.edu/dataset/320/student+performance
date()
## [1] "Mon Nov 27 15:55:07 2023"
library(tidyverse)
library(dplyr)
library(readr)
Reading joined and modified data in and checking out the column names:
setwd("~/Polyembryony_R/IODS/IODS-project/data")
alc <- read_csv("~/Polyembryony_R/IODS/IODS-project/data/alc.csv")
## New names:
## Rows: 370 Columns: 36
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi... dbl
## (18): ...1, age, Medu, Fedu, traveltime, studytime, famrel, freetime, go... lgl
## (1): high_use
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(alc)
## # A tibble: 6 × 36
## ...1 school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 2 GP F 17 U GT3 T 1 1 at_home other
## 3 3 GP F 15 U LE3 T 1 1 at_home other
## 4 4 GP F 15 U GT3 T 4 2 health services
## 5 5 GP F 16 U GT3 T 3 3 other other
## 6 6 GP M 16 U LE3 T 4 3 services other
## # ℹ 25 more variables: reason <chr>, guardian <chr>, traveltime <dbl>,
## # studytime <dbl>, schoolsup <chr>, famsup <chr>, activities <chr>,
## # nursery <chr>, higher <chr>, internet <chr>, romantic <chr>, famrel <dbl>,
## # freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>, health <dbl>,
## # failures <dbl>, paid <chr>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>,
## # alc_use <dbl>, high_use <lgl>
I study sex, age, absences and home address (urban/rural) as explanatory factors for a student’s alcohol consumption. I hypothesize that the younger female students having more absences and living in the rural locations consume the highest amount of alcohol. I assume all the variables have equal, additive effect on the response variable. Let’s explore the data:
g1 <- ggplot(data = alc, aes(x = alc_use))
g1 + geom_bar()
It seems that the majority uses relatively low amount of alcohol.
g2 <- ggplot(alc, aes(x = sex, y = alc_use, fill=sex))+
geom_boxplot()
g2
Surprisingly, this might indicate that my hypothesis about females
consuming more alcohol might not be correct. I’ll study this strange
thing more in detail.
g3 <- ggplot(alc, aes(x = age, y = alc_use))+
geom_boxplot()
g3
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
Not so informative plot type.
g4 <- ggplot(alc, aes(x = address, y = alc_use, fill=address))+
geom_boxplot()
g4
My hypothesis about rural living causing high alcohol consumption might
well be correct!
g5 <- ggplot(alc, aes(x = alc_use, y = absences, color=absences))+
geom_point()
g5
The alcohol consumption seems to be quite scattered around when plotted
together with absences. No clear indication of my hypothesis being
correct.
Logistic regression:
m <- glm(high_use ~ age + sex + address + absences, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ age + sex + address + absences, family = "binomial",
## data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.71552 1.77325 -2.095 0.0361 *
## age 0.13172 0.10325 1.276 0.2020
## sexM 1.03363 0.24529 4.214 2.51e-05 ***
## addressU -0.40391 0.28314 -1.427 0.1537
## absences 0.09333 0.02356 3.962 7.45e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 411.27 on 365 degrees of freedom
## AIC: 421.27
##
## Number of Fisher Scoring iterations: 4
coefficients(m)
## (Intercept) age sexM addressU absences
## -3.7155227 0.1317217 1.0336294 -0.4039128 0.0933257
So, it seems that age and address are statistically insignificant, whereas male sex and high number of absences have statistically highly significant effect on alcohol consumption. Checking the odds ratios and confidence intervals:
OR <- coef(m) %>% exp
CI <- confint(m)
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02434271 -7.24462192 -0.2732444
## age 1.14079079 -0.06972570 0.3361414
## sexM 2.81125045 0.55900936 1.5225186
## addressU 0.66770237 -0.95558085 0.1574767
## absences 1.09781924 0.04917901 0.1416402
Odds ratio being > 1, i.e. age, male sex, and absences, are positively correlated to alcohol consumption. Odds ratio for urban address is < 1, indicating that it would be negatively correlated. Confidence intervals reveal that age and address are statistically insignificant, whereas male sex and absences are significant. Making a new model with only statistically significant sex and absences included:
m2 <- glm(high_use ~ absences + sex, data = alc, family = "binomial")
probabilities <- predict(m2, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probabilities)
select(alc, absences, sex, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 5
## absences sex high_use probability prediction
## <dbl> <chr> <lgl> <dbl> <dbl>
## 1 3 M FALSE 0.371 0.371
## 2 0 M FALSE 0.306 0.306
## 3 7 M TRUE 0.465 0.465
## 4 1 F FALSE 0.147 0.147
## 5 6 F FALSE 0.219 0.219
## 6 2 F FALSE 0.160 0.160
## 7 2 F FALSE 0.160 0.160
## 8 3 F FALSE 0.173 0.173
## 9 4 M TRUE 0.394 0.394
## 10 2 M TRUE 0.349 0.349
table(high_use = alc$high_use, prediction = probabilities)
## prediction
## high_use 0.13551762779056 0.147251609825878 0.159813770121884 0.173229912920424
## FALSE 28 17 30 15
## TRUE 6 6 6 4
## prediction
## high_use 0.187520951046583 0.202701911186317 0.218780924782799
## FALSE 11 10 9
## TRUE 2 3 3
## prediction
## high_use 0.235758237182898 0.253625273514685 0.272363804567288
## FALSE 7 9 4
## TRUE 0 1 2
## prediction
## high_use 0.291945259033951 0.30607905167818 0.312330229220817 0.32699514282328
## FALSE 3 22 1 20
## TRUE 0 7 0 7
## prediction
## high_use 0.333468215141401 0.348622467320154 0.355297646392281
## FALSE 3 10 1
## TRUE 0 10 0
## prediction
## high_use 0.370891910681723 0.377746212171274 0.39372404069047 0.417029993594045
## FALSE 16 1 13 6
## TRUE 3 2 9 3
## prediction
## high_use 0.424162067815099 0.440712681060035 0.464668291130248
## FALSE 0 7 2
## TRUE 1 2 3
## prediction
## high_use 0.471955510526924 0.488788039009662 0.49610297710717 0.512960107955271
## FALSE 1 5 0 1
## TRUE 0 5 1 4
## prediction
## high_use 0.520268637326094 0.537071708240416 0.544339820805449 0.56101117435482
## FALSE 2 2 1 1
## TRUE 0 2 1 3
## prediction
## high_use 0.584670018040823 0.607944857721631 0.630739153048458
## FALSE 0 0 0
## TRUE 4 1 4
## prediction
## high_use 0.680934752085533 0.695404986885368 0.715493971837525
## FALSE 0 0 0
## TRUE 1 1 1
## prediction
## high_use 0.721413863681047 0.844995611964618 0.916987057630397
## FALSE 0 0 0
## TRUE 1 1 1
## prediction
## high_use 0.924057950235558
## FALSE 1
## TRUE 0
g6 <- ggplot(alc, aes(x = high_use, y = probabilities))+
geom_boxplot()
g6
summary(m2)
##
## Call:
## glm(formula = high_use ~ absences + sex, family = "binomial",
## data = alc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.85303 0.22609 -8.196 2.49e-16 ***
## absences 0.09671 0.02336 4.140 3.48e-05 ***
## sexM 1.03451 0.24395 4.241 2.23e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 415.66 on 367 degrees of freedom
## AIC: 421.66
##
## Number of Fisher Scoring iterations: 4
I’m not quite sure wheter I got the cross tabulation right. The second model looks good statistics-wise. Trying to compute the average number of wrong predictions:
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
So, it seems that the error is smaller with the model than by guessing. With guessing, it would be 0.5.
Eeva Vakkari
https://github.com/EevaVakkari/IODS-project
This week, we use the Boston data set which contains various housing values of the suburbs. The variables include socioeconomic, demographic and geographic information. In this exercise, I study potentially explanatory variables for crime rate with clustering and linear discriminant analysis (LDA).
date()
## [1] "Mon Nov 27 15:55:17 2023"
Loading required packages:
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(tidyverse)
library(dplyr)
library(readr)
Reading in the data. The data set is inbuilt in the MASS package. Thus, it can be read into R just like this:
data("Boston")
I have a look at the data:
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The data set is now a data frame and it is comprised of 506 observations and 14 variables. All of the variables are numerical; either integer or numeric (decimal).
The variables
crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over
25,000 sq.ft.
indus: proportion of non-retail business acres per
town.
chas: Charles River dummy variable (= 1 if tract bounds
river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10
million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to
1940.
dis: weighted mean of distances to five Boston employment
centers.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
black: the proportion of blacks by town.
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.
Next, I’m creating a graphical overview of the data.
I draw a plot matrix with pairs() and, then, a correlation
plot with corrplot().
pairs(Boston)
This looks at bit messy. I think I could fix it by adding arguments
to pairs() but I’d rather try a different approach to get
there histograms to better study distributions. I also try to include
there Spearman correlation coefficents.
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 4.3.2
## Loading required package: xts
## Warning: package 'xts' was built under R version 4.3.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
chart.Correlation(Boston, histogram = TRUE, method = "spearman")
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
## Warning in cor.test.default(as.numeric(x), as.numeric(y), method = method):
## Cannot compute exact p-value with ties
## Warning in par(usr): argument 1 does not name a graphical parameter
I got plenty of warnings, and it is still a bit messy looking plot
but I got there the histograms and the Spearman correlation
coefficients. Now, it is easier to inspect the distributions and
correlations. The distributions of crim,
zn, age, dis,
ptratio, black, and lstat are
very skewed. indus, rad, and tax
have two peaks. chas is encoded as 0 and 1.
nox, rm, and medv are
approximately somewhat normally distributed. The Spearman correlation
coefficients indicate several positive and negative
correlations between the variables which are statistically
significant.
I draw a correlation plot to visualize the correlations:
cor_matrix <- cor(Boston)
corrplot(cor_matrix, method="circle")
The majority of the variables are inter-correlated. Only
chas seems to have very weak, if any, correlation to other
variables.
I’m performing the scaling as in the exercise, with function
scale()
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
The means are now at zero for all variables. The scaled data is in
form of matrix array.
I make the scaled data to be a data frame:
boston_scaled <- as.data.frame(boston_scaled)
First, I’m checking out the quantiles:
boston_scaled$crim <- as.numeric(boston_scaled$crim)
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
Next, I’m using the quantiles in making of the categories for crime rate:
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
It seems to work! I have dropped out the old crim and
got in a new crime variable which has crime rate gategories
based on the quantiles.
Now, I’m making the train and test sets. I pick randomly 80 % of the rows into the train set. I also save the correct classes and remove the crime variable from the test data. Here, I’m following closely the exercise 4 codes.
boston_scaled <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt",
sep=",", header = T)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
nrow(boston_scaled)
## [1] 506
n <- 506
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
Now, the data is prepared for LDA.
With the LDA, it is possible to find linear combination of the variables that separate the target variable classes, i.e. we should be able to distinguish the variables which predict certain crime rate classes in our data. This all new to me, thus, I follow the exercise 4 in my code writing.
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2648515 0.2450495 0.2252475 0.2648515
##
## Group means:
## zn indus chas nox rm age
## low 0.9571583 -0.90326041 -0.12514775 -0.8562584 0.4273890 -0.8857497
## med_low -0.0820729 -0.29131093 0.04582045 -0.5713018 -0.1523137 -0.3121985
## med_high -0.3628494 0.09399949 0.11705447 0.3195630 0.1723583 0.3262420
## high -0.4872402 1.01499462 -0.01476176 1.0160270 -0.3864308 0.7978230
## dis rad tax ptratio black lstat
## low 0.8246790 -0.6867044 -0.7320485 -0.40596776 0.3729334 -0.75758753
## med_low 0.3397989 -0.5433656 -0.4832720 -0.03684897 0.3099670 -0.11672891
## med_high -0.3252093 -0.4114238 -0.3515436 -0.29771862 0.1112725 -0.04110072
## high -0.8354580 1.6596029 1.5294129 0.80577843 -0.9038364 0.85026527
## medv
## low 0.50939411
## med_low -0.02586214
## med_high 0.28140084
## high -0.68002945
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12952798 0.82879811 -0.81024862
## indus 0.08797313 -0.29711773 0.36219028
## chas 0.01544735 0.03613783 0.22016226
## nox 0.26759110 -0.83693240 -1.55339988
## rm 0.02479918 -0.04179996 -0.14445856
## age 0.24773144 -0.28567414 0.11680868
## dis -0.08751735 -0.50882034 0.17098881
## rad 3.65024580 0.90183378 -0.02903152
## tax 0.06665855 -0.03327300 0.49559750
## ptratio 0.13602926 0.09639190 -0.29226525
## black -0.13132499 0.02313490 0.11136560
## lstat 0.17904449 -0.36189141 0.20577230
## medv 0.08232271 -0.59289072 -0.38355037
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9627 0.0271 0.0102
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
It seems that rad has a very big impact on the crime
rate, i.e. the geographical location of a suburb related to circular
highways might the key explanatory variable behind medium high and high
crime rates of the suburb in question.
Now, I check how good my model was by cross-tabulations:
lda.pred <- predict(lda.fit, newdata = test)
lda.pred
## $class
## [1] low low med_low med_high med_low med_low low med_low
## [9] med_low med_low med_low med_low med_low low med_low med_low
## [17] med_low med_low low med_low med_low low low low
## [25] med_low low med_low med_high med_high med_high med_high med_high
## [33] med_high med_high med_high med_high med_high med_high med_high med_high
## [41] med_high med_high med_high med_high med_high med_low low low
## [49] med_low med_low med_low med_low med_high low med_low low
## [57] med_high med_high low med_high low low low med_low
## [65] med_low low med_low med_low med_low med_low med_low low
## [73] med_low med_low med_low low high high high high
## [81] high high high high high high high high
## [89] high high high high high high high high
## [97] high med_low med_high med_high med_high med_high
## Levels: low med_low med_high high
##
## $posterior
## low med_low med_high high
## 1 8.313012e-01 1.481303e-01 2.056851e-02 5.207350e-28
## 4 5.094802e-01 4.017598e-01 8.875999e-02 1.312005e-25
## 5 3.621971e-01 4.513805e-01 1.864224e-01 6.274690e-25
## 8 1.735376e-02 3.769785e-01 6.056678e-01 5.456819e-19
## 11 3.446005e-02 6.753652e-01 2.901747e-01 3.230881e-19
## 13 1.771631e-01 6.535034e-01 1.693334e-01 4.104635e-21
## 17 4.954214e-01 3.638900e-01 1.406886e-01 1.912679e-22
## 18 1.318022e-01 5.349782e-01 3.332195e-01 5.663858e-20
## 20 2.545722e-01 5.285971e-01 2.168307e-01 1.231969e-20
## 23 8.606350e-02 5.535745e-01 3.603620e-01 2.474700e-19
## 28 8.590920e-02 5.224227e-01 3.916681e-01 3.688229e-19
## 32 1.229080e-01 5.660650e-01 3.110270e-01 1.656261e-19
## 34 9.446774e-02 6.063789e-01 2.991533e-01 4.100373e-19
## 42 6.065143e-01 3.804990e-01 1.298665e-02 2.513709e-27
## 51 3.647639e-01 6.133318e-01 2.190437e-02 1.188734e-23
## 52 3.702781e-01 6.029162e-01 2.680569e-02 2.342704e-23
## 59 3.862113e-01 4.845008e-01 1.292879e-01 9.354222e-17
## 65 3.565797e-01 5.352051e-01 1.082152e-01 3.342712e-25
## 66 9.876110e-01 1.227930e-02 1.097337e-04 2.412265e-25
## 69 2.532874e-01 7.394785e-01 7.234110e-03 5.416486e-24
## 77 4.939391e-02 9.075758e-01 4.303033e-02 5.469965e-20
## 84 6.859428e-01 3.035502e-01 1.050700e-02 2.833810e-23
## 92 5.128592e-01 4.540423e-01 3.309854e-02 1.278693e-25
## 96 5.362262e-01 4.413454e-01 2.242846e-02 2.237376e-26
## 97 3.912698e-01 5.938170e-01 1.491322e-02 4.862396e-26
## 98 4.390727e-01 4.324926e-01 1.284347e-01 1.682502e-25
## 117 6.829366e-02 5.670018e-01 3.647045e-01 2.978303e-17
## 126 5.584977e-02 4.006759e-01 5.434744e-01 3.390309e-22
## 130 1.702076e-02 2.852386e-01 6.977406e-01 1.449926e-17
## 131 1.569135e-02 1.696113e-01 8.146974e-01 8.429141e-18
## 135 9.290164e-03 1.737215e-01 8.169883e-01 6.535724e-17
## 140 1.071307e-02 1.802913e-01 8.089957e-01 2.525045e-17
## 141 8.938279e-03 2.172788e-01 7.737829e-01 5.649555e-17
## 143 9.567890e-05 4.202564e-03 9.957018e-01 1.378917e-15
## 144 5.884280e-05 1.140785e-03 9.988004e-01 4.257305e-16
## 149 3.588402e-05 8.347555e-04 9.991294e-01 5.157836e-16
## 151 1.268846e-04 7.650970e-04 9.991080e-01 5.826437e-17
## 154 9.793488e-05 7.335431e-04 9.991685e-01 2.846768e-16
## 157 1.647121e-04 1.126997e-03 9.987083e-01 1.659058e-15
## 159 2.062790e-02 3.131163e-01 6.662558e-01 3.741054e-18
## 161 1.684662e-02 4.877760e-01 4.953773e-01 4.642085e-18
## 164 1.032916e-03 3.450719e-02 9.644599e-01 2.443552e-18
## 166 9.633253e-03 2.090478e-01 7.813190e-01 1.231869e-17
## 171 1.232523e-02 3.910929e-01 5.965819e-01 1.252522e-17
## 172 1.474174e-02 3.987403e-01 5.865179e-01 5.229707e-18
## 184 3.691674e-01 4.133139e-01 2.175187e-01 5.244272e-23
## 200 9.950392e-01 4.756224e-03 2.045813e-04 2.655903e-26
## 204 9.941466e-01 3.735412e-03 2.117944e-03 2.055611e-23
## 208 6.379516e-02 7.493601e-01 1.868448e-01 5.342530e-21
## 216 2.684776e-01 6.355077e-01 9.601469e-02 1.347186e-22
## 221 2.890544e-02 6.073879e-01 3.637067e-01 6.614350e-14
## 222 9.076703e-03 6.746215e-01 3.163018e-01 3.650707e-13
## 224 4.192583e-02 3.301647e-01 6.279095e-01 1.172438e-14
## 240 5.351570e-01 4.400119e-01 2.483112e-02 2.392729e-20
## 242 2.936538e-01 6.679882e-01 3.835794e-02 2.725788e-19
## 252 5.441882e-01 4.238978e-01 3.191400e-02 9.797522e-20
## 259 3.695770e-02 2.196512e-02 9.410772e-01 1.824973e-18
## 262 1.913277e-02 1.056245e-02 9.703048e-01 6.227887e-19
## 275 8.445970e-01 1.492673e-01 6.135627e-03 3.436344e-23
## 281 3.373085e-01 3.090231e-01 3.536684e-01 4.155114e-21
## 285 9.947957e-01 5.102113e-03 1.021427e-04 4.970875e-30
## 286 9.433840e-01 5.641779e-02 1.982456e-04 1.332707e-30
## 289 7.437957e-01 2.475828e-01 8.621459e-03 2.415697e-20
## 295 8.371422e-02 8.965800e-01 1.970574e-02 2.971578e-24
## 297 7.671551e-02 8.762147e-01 4.706977e-02 5.667533e-24
## 304 6.414992e-01 3.101944e-01 4.830636e-02 4.306144e-19
## 313 1.385477e-01 5.776184e-01 2.838339e-01 3.179032e-20
## 317 6.421486e-02 5.962866e-01 3.394986e-01 3.503129e-20
## 320 1.697170e-01 5.656439e-01 2.646390e-01 2.224595e-21
## 322 2.773218e-01 5.809938e-01 1.416844e-01 1.140040e-20
## 323 2.875059e-01 6.158623e-01 9.663177e-02 6.090847e-21
## 326 4.627774e-01 4.588678e-01 7.835475e-02 2.317477e-22
## 328 1.700215e-01 6.625763e-01 1.674023e-01 9.743284e-21
## 336 3.854558e-01 4.529197e-01 1.616246e-01 1.827862e-21
## 346 1.554657e-01 8.258576e-01 1.867673e-02 8.386959e-26
## 348 9.834460e-01 1.588460e-02 6.694177e-04 5.928719e-24
## 361 5.481843e-23 4.398028e-20 2.307415e-15 1.000000e+00
## 365 1.245657e-22 1.253387e-19 7.811642e-16 1.000000e+00
## 367 1.736149e-23 3.427352e-20 3.339222e-16 1.000000e+00
## 374 3.538829e-25 6.123468e-21 1.436735e-17 1.000000e+00
## 376 1.029607e-22 1.573860e-19 3.519046e-16 1.000000e+00
## 379 9.580369e-24 4.304796e-20 9.107028e-17 1.000000e+00
## 380 1.674947e-23 6.461698e-20 8.883979e-17 1.000000e+00
## 384 3.835968e-24 1.982882e-20 6.601458e-17 1.000000e+00
## 385 6.376479e-25 5.315436e-21 1.287353e-17 1.000000e+00
## 391 3.462673e-23 9.317331e-20 3.298116e-16 1.000000e+00
## 394 1.138965e-22 2.101666e-19 5.181474e-16 1.000000e+00
## 404 5.107236e-23 1.539347e-19 1.915298e-16 1.000000e+00
## 408 2.365632e-23 6.004967e-20 3.316901e-16 1.000000e+00
## 439 3.594566e-27 2.856379e-23 6.496147e-19 1.000000e+00
## 440 4.348967e-24 1.437418e-20 1.220856e-16 1.000000e+00
## 446 1.503693e-26 5.429043e-23 1.676447e-18 1.000000e+00
## 449 1.707643e-23 4.921938e-20 2.749972e-16 1.000000e+00
## 463 1.562776e-22 2.714785e-19 2.112985e-15 1.000000e+00
## 473 4.461231e-21 1.617430e-17 7.467259e-15 1.000000e+00
## 474 2.451298e-21 4.697262e-18 1.053083e-14 1.000000e+00
## 476 1.943789e-23 2.009911e-19 7.452385e-17 1.000000e+00
## 489 5.777338e-03 5.809576e-01 4.132651e-01 1.871125e-17
## 493 8.057371e-03 4.718570e-01 5.200856e-01 4.974841e-18
## 495 1.026422e-01 2.966730e-01 6.006848e-01 2.201760e-17
## 496 9.463448e-02 3.413757e-01 5.639898e-01 1.294843e-17
## 500 6.908531e-02 4.015657e-01 5.293490e-01 1.905584e-16
##
## $x
## LD1 LD2 LD3
## 1 -4.3270698 -0.10257671 -0.68724487
## 4 -3.7921620 -0.59366636 -0.14048382
## 5 -3.6312560 -0.98882555 -0.18139274
## 8 -2.0787036 -1.68796470 0.67051157
## 11 -2.1731878 -1.06459500 1.18979443
## 13 -2.6958298 -0.45009608 0.56944768
## 17 -3.0411265 -0.15432759 -0.41409248
## 18 -2.4151660 -0.68023176 0.23903040
## 20 -2.5974131 -0.30154371 0.08325740
## 23 -2.2437584 -0.77726364 0.45145942
## 28 -2.2019124 -0.78149830 0.36241723
## 32 -2.3010799 -0.57733930 0.35729681
## 34 -2.1956218 -0.59421506 0.57082188
## 42 -4.1718962 0.10706695 0.55186246
## 51 -3.2960364 0.38634956 1.03104954
## 52 -3.2297620 0.35348326 0.92044762
## 59 -1.6790660 1.00040119 0.02255621
## 65 -3.6911568 -0.78039132 0.22080367
## 66 -3.5125612 3.34083210 -0.82732646
## 69 -3.3444453 0.71136160 1.86805482
## 77 -2.3486684 -0.09274069 2.11132359
## 84 -3.1990235 1.15950224 0.37669256
## 92 -3.7803771 -0.08810323 0.39788162
## 96 -3.9548038 -0.03020767 0.51665013
## 97 -3.8610200 0.09967774 1.12896334
## 98 -3.7686310 -0.83129303 -0.15709112
## 117 -1.7367390 -0.43833334 0.58743113
## 126 -2.9023464 -1.79679031 0.18424758
## 130 -1.7308636 -1.45124231 0.36048770
## 131 -1.7680645 -1.60166113 -0.15164285
## 135 -1.5305132 -1.64999406 0.13337338
## 140 -1.6372882 -1.67080764 0.10031372
## 141 -1.5503718 -1.66094179 0.38509040
## 143 -0.8616185 -3.42453351 -1.13311487
## 144 -0.9142279 -3.71402636 -2.10922895
## 149 -0.8589027 -3.91101649 -2.15309418
## 151 -1.1448928 -3.53746136 -2.86860552
## 154 -0.9660949 -3.50468521 -2.77723534
## 157 -0.8241270 -3.11590044 -2.63547614
## 159 -1.8832009 -1.47059007 0.37044003
## 161 -1.8603386 -1.40480805 1.01438171
## 164 -1.7090407 -2.99194630 -0.34917636
## 166 -1.7108285 -1.77399306 0.30638745
## 171 -1.7378261 -1.54220581 0.88472728
## 172 -1.8376232 -1.53472976 0.81994620
## 184 -3.1733812 -0.63968888 -0.33738038
## 200 -3.7208788 2.84725209 -1.98800207
## 204 -3.0680549 2.27553656 -3.22191553
## 208 -2.6242149 -0.94659784 1.16655270
## 216 -3.0588128 -0.28865675 0.57887132
## 221 -0.8974741 -0.10529954 1.08834343
## 222 -0.6635970 -0.40415238 1.82831399
## 224 -1.0848180 -0.35972393 0.09513179
## 240 -2.5175559 1.22466644 0.47914062
## 242 -2.2582520 0.94146090 0.98271721
## 252 -2.3758146 1.23688917 0.32802785
## 259 -1.9015552 -1.35966080 -2.55300692
## 262 -1.9554782 -1.75201398 -2.92069934
## 275 -3.1545327 1.57449168 -0.15715056
## 281 -2.7149855 -0.50977512 -0.77167595
## 285 -4.5996341 2.39440012 -1.62654099
## 286 -4.8292148 1.82170416 0.35589838
## 289 -2.4927341 1.94138888 0.23561114
## 295 -3.3774458 -0.37399934 2.16806743
## 297 -3.3221500 -0.80161398 1.81343871
## 304 -2.2277549 1.24816153 -0.22474371
## 313 -2.4770759 -0.63171415 0.35479232
## 317 -2.4332282 -1.06682471 0.69225734
## 320 -2.7607981 -0.75292512 0.26228752
## 322 -2.6047846 -0.05341351 0.31296650
## 323 -2.6661995 0.09949941 0.51477332
## 326 -3.0145626 0.12684769 0.09036152
## 328 -2.6044282 -0.38195324 0.60843367
## 336 -2.8048742 -0.13518403 -0.14271060
## 346 -3.7742306 -0.39735463 1.80225638
## 348 -3.2241975 2.69844044 -1.36649316
## 361 6.7787175 -0.62801201 -1.97313361
## 365 6.7224966 0.26959731 -0.93713483
## 367 6.8818883 -0.14620157 -0.79226756
## 374 7.1960520 -0.23918396 0.91341294
## 376 6.7394923 0.58652101 -0.28353569
## 379 6.9286861 0.24545190 0.28149715
## 380 6.8872869 0.49916872 0.39159298
## 384 7.0072860 0.02055750 0.15569141
## 385 7.1737021 0.09088645 0.53380859
## 391 6.8131706 0.14088105 -0.19895048
## 394 6.7170176 0.42297551 -0.23171986
## 404 6.7869284 0.58148235 0.31065702
## 408 6.8472261 -0.02080565 -0.42071141
## 439 7.6698895 -0.55080706 -0.45753051
## 440 7.0002941 -0.22854679 -0.47403337
## 446 7.5581033 -0.40937189 -0.98572690
## 449 6.8739543 -0.06591827 -0.36189222
## 463 6.6653256 -0.16698898 -0.76028281
## 473 6.3327600 0.57032574 0.82999716
## 474 6.3986917 0.16203957 -0.17350705
## 476 6.8440583 0.61895866 1.45279276
## 489 -1.6650440 -1.67401659 1.79324325
## 493 -1.8161868 -1.75868158 1.33191194
## 495 -1.7756056 -0.51636998 -0.43796712
## 496 -1.8302165 -0.57570797 -0.23921487
## 500 -1.5401905 -0.43867870 0.09919169
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 7 1 0
## med_low 7 15 5 0
## med_high 1 12 20 2
## high 0 0 1 19
Interpretation and evaluation of the LDA
The model predicts very well the categories of medium high and high
crime rates, whereas other categories (medium low and low crime rates)
are not as accurately predicted. The model might be useful in detecting
areas of high risk for high crime rate, but it shouldn’t be used the
other way round, to detect low crime rate suburbs.
First, I study Euclidean and Manhattan distances. Not to mess up my
previous work, I scale the data again and use
boston_scaled2 in this part.
data("Boston")
boston_scaled2 <- scale(Boston)
dist_eu <- boston_scaled2
dist_man <- boston_scaled2
summary(dist_eu)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
summary(dist_man)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Next, I perform a kmeans clustering with seed set at 13 and centers at 4_:
set.seed(13)
km <- kmeans(boston_scaled2, centers = 4)
pairs(boston_scaled2, col = 4)
Next, I verify whether these values were OK and visualize the result, as in the exercise 4:
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
It seems that it might be good to have 5 centers.
# k-means clustering
km <- kmeans(boston_scaled2, centers = 5)
# plot the data set with clusters
pairs(boston_scaled2, col = km$cluster)
Based on the plot above, there are several variables which explain
the clustering. For simplicity, I pick up two continuous variables,
lstat and medv, which seem to separate the
clusters nicely, to closer inspection and plotting with the
clusters:
plot_data <- as.data.frame(boston_scaled2)
plot_data$cluster <- factor(km$cluster)
ggplot(plot_data, aes(x = lstat, y = medv)) +
geom_point(aes(color = cluster)) +
labs(color = 'Cluster')
To remind, what the variables stand for:
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.
The clusters are not very clearly separated with these two variables,
but there are some features of each clusters that can be observed.
Cluster 1 is “the average neighborhood”, characterized
by roughly mean values in both variables: the value of owner-occupied
homes and percentage of lower status population. Cluster
2 differs from cluster 1 by having higher value (or just more)
of owner-occupied homes and lower percentage of lower status population.
Cluster 3 is overlapping with other clusters, but its
emphasis is in the higher percentage of lower status population and
lower value of owner-occupied homes. Cluster 4 is
overlapping with cluster 3, but it has even more pronounced pattern of
high representation of lower status population and lower value
owner-occupied homes. Cluster 5 is best described as
Kauniainen, i.e. it has generally high value of owner-occupied
homes and low percentage of lower status recidents, but there are still
few present (most likely living close to the railway station). ;)
The variables seem to have correlation, which interfers the results.
E.g. lstat and medv seem to have negative
correlation, meaning the higher the value of owner-occupied homes, the
lower the percentage of the lower status population is in the area.
The cluster separation could be more distinct. Now, they overlap
quite a lot to my taste. Probably I misinterpreted the elbow plot and
the real cluster number should have been less than 5. Also, the data
should be cleaned for outliers: I suspect that the spots at max value of
medv are actually outliers.
I give this a shot in dark, doing copy-paste with the previous code chunks:
boston_scaled3 <- scale(Boston)
kmeans_bonus <- kmeans(boston_scaled3, centers = 5)
Boston$cluster <- factor(kmeans_bonus$cluster)
lda_fit_bonus <- lda(cluster ~ ., data = Boston)
Now, testing visualization:
plot(lda_fit_bonus)
And then, checking out the coefficients:
lda_fit_bonus$scaling
## LD1 LD2 LD3 LD4
## crim 0.0080852882 -2.134639e-02 -1.652315e-02 0.005530265
## zn -0.0031069655 -6.303245e-02 3.040976e-02 -0.051703569
## indus 0.0253157070 5.494003e-02 6.928201e-02 -0.046050129
## chas -0.7677964641 4.193077e-01 -1.210915e+00 -0.858103945
## nox -0.4146107663 7.748986e-01 1.300230e+00 -3.497642548
## rm -0.0100449764 -2.071640e-01 -8.460646e-01 -0.466827093
## age 0.0018135788 1.770948e-02 -7.796642e-03 -0.025090244
## dis -0.1329179998 -1.605232e-01 1.088635e-01 0.151761364
## rad 0.5010233181 -1.243454e-01 -9.880000e-02 0.170930126
## tax -0.0009979140 -2.646755e-03 2.713772e-03 -0.005439178
## ptratio 0.0267862421 8.509904e-02 1.383573e-01 -0.126703143
## black -0.0004856502 -3.039818e-05 -6.642119e-05 0.001016834
## lstat 0.0170738701 8.619402e-03 -7.189688e-03 -0.054987522
## medv -0.0193304059 -2.865138e-02 -7.933669e-02 -0.046756580
It seems that crim, chas, and
nox have the highest impacts in clustering.
model_predictors <- Boston[, -which(names(Boston) == "cluster")]
dim(model_predictors)
## [1] 506 14
dim(lda.fit$scaling)
## [1] 13 3
Here, I’m taking the advange of the Bonus exercise:
matrix_product <- as.matrix(model_predictors) %*% lda_fit_bonus$scaling
matrix_product <- as.data.frame(matrix_product)
Getting plotly:
#install.packages("plotly")
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Drawing a 3D plot with the crime categories as coloring:
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = boston_scaled$crime)
Also here, it is clear that the class “high” is standing far out from the other classes.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)
It seems that the clusters for crime rate and the clusters based on kmeans represent just the same socioeconomic clusters!